Preprocessing with a Tokenizer

>> What is preprocessing?
  
  - Preprocessing involves converting raw text into a format that a model can understand. This usually means turning text into numbers.


In our example:

  1- Text Inputs: We start with sentences:

python code:

'''
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
'''

  2- Tokenization: We use the tokenizer to convert these sentences into numerical format. First, we need to download the tokenizer:

python code:

'''
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
'''

  3- Tokenize and Convert to Tensors: We pass our sentences to the tokenizer, specifying how we want the tensors:

python code:

'''
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
'''

The inputs dictionary contains:

- input_ids: Numerical representations of the tokens.
- attention_mask: Indicates which tokens are real and which are padding.

Example output:
{
    'input_ids': tensor([
        [  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172, 2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,     0,     0,     0,     0,     0,     0]
    ]), 
    'attention_mask': tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    ])
}